Nonparametric Bayesian Biclustering

نویسندگان

Edward Meeds

Sam Roweis

چکیده

We present a probabilistic block-constant biclustering model that simultaneously clusters rows and columns of a data matrix. All entries with the same row cluster and column cluster form a bicluster. Each cluster is part of a mixture having a nonparametric Bayesian prior. The number of biclusters is therefore treated as a nuisance parameter and is implicitly integrated over during simulation. Missing entries are completely integrated out of the model, allowing us to completely bipass the common requirement for biclustering algorithms that missing values be filled before analysis, but also makes it robust to high rates of missing values. By using a Gaussian model for the density of entries in bliclusters, an efficient sampling algorithm is produced because bicluster parameters are analytically integrated out. We present several inference procedures for sampling cluster indicators, including Gibbs and split-merge moves. We show that our method is competitive, if not superior, to existing imputation methods, especially for high missing rates, despite imputing constant values for entire blocks of data. We present imputation experiments and exploratory biclustering results. Nonparametric Bayesian Biclustering Edward Meeds and Sam Roweis Department of Computer Science, University of Toronto 1 Biclustering Biclustering (also known as co-clustering or 2-way clustering) refers to the the simultaneous grouping of rows and columns of a data matrix. Each bicluster is a submatrix of the full (possibly reordered) data matrix and entries in a bicluster should have some coherent structure (the details of which depend on the method employed). This coherence could be, for example, constant values for all entries in the submatrix, or similar row/column patterns within a bicluster. Biclustering algorithms are also characterized by how the rows and columns are assigned to clusters. Rows/columns can either belong to multiple clusters (as shown in Figure 2A & 2B) or to only a single cluster (as shown in 2C); clusters can overlap (2A) or not (2B & 2C). Some matrix entries may also belong to a “background” noise model which is not part of any bicluster (2A & 2B). Most representations assume that there exists a single permutation of the matrix rows/columns after which all the biclusters are contiguous blocks. (Matrix tile analysis [4] is an exception.) Our approach produces biclusters like those in Figure 2C: each row and column belongs to a single, non-overlapping cluster. Figure 1: Left: Original data. Right: Data after biclustering. Assessing the significance of partitions discovered by biclustering is problematic for several reasons. First, there are few available data sets which are annotated with ground truth partitions. Second, those that are annotated may have partitions that do not correspond to any possible result of the clustering algorithm. Third, most algorithms have parameters which modify the scale/size of partitions which are discovered. Deciding which scale is best in a purely unsupervised manner is difficult and poorly defined. For example, a common goal when clustering microarray data is to group genes and/or experiments in such a way that the partitions are biologically “significant” or “plausible”; often this is assessed by examining the clusters by hand [2, 8]. Several modeling issues must be addressed by any biclustering method. The most important is a method for assessing when a partition represents a significant bicluster. This is closely related to the choice of the number of clusters to use. Most biclustering use greedy procedures for fitting biclusters one at a time until either a global fitting objective or a pre-specified number of clusters has been reached [2, 8]. Of course, if the only objective is to reduce some measure of fitting residual, overfitting will occur unless the model is highly regularized, especially in very flexible non-probabilistic models. By restricting the permissible types of clusters we can control capacity; we can also use a probabilistic model of the data and use marginal likelihood as a guide. The biclustering algorithm that we present here is a fully probabilistic model which uses Bayesian nonparametric priors over row and column clusters. This allows us to treat the number of biclusters as a nuisance parameter and implicitly integrate it out

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Nonparametric Bayesian Co-clustering Ensembles

A nonparametric Bayesian approach to co-clustering ensembles is presented. Similar to clustering ensembles, coclustering ensembles combine various base co-clustering results to obtain a more robust consensus co-clustering. To avoid pre-specifying the number of co-clusters, we specify independent Dirichlet process priors for the row and column clusters. Thus, the numbers of rowand column-cluster...

متن کامل

Bayesian Nonparametric and Parametric Inference

This paper reviews Bayesian Nonparametric methods and discusses how parametric predictive densities can be constructed using nonparametric ideas.

متن کامل

Introducing of Dirichlet process prior in the Nonparametric Bayesian models frame work

Statistical models are utilized to learn about the mechanism that the data are generating from it. Often it is assumed that the random variables y_i,i=1,…,n ,are samples from the probability distribution F which is belong to a parametric distributions class. However, in practice, a parametric model may be inappropriate to describe the data. In this settings, the parametric assumption could be r...

متن کامل

Gene co-expression networks via biclustering Differential gene co-expression networks via Bayesian biclustering models

Identifying latent structure in large data matrices is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are locally co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-re...

متن کامل

Biclustering-Driven Ensemble of Bayesian Belief Network Classifiers for Underdetermined Problems

In this paper, we present BENCH (Biclusteringdriven ENsemble of Classifiers), an algorithm to construct an ensemble of classifiers through concurrent feature and data point selection guided by unsupervised knowledge obtained from biclustering. BENCH is designed for underdetermined problems. In our experiments, we use Bayesian Belief Network (BBN) classifiers as base classifiers in the ensemble;...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Nonparametric Bayesian Biclustering

نویسندگان

چکیده

منابع مشابه

Nonparametric Bayesian Co-clustering Ensembles

Bayesian Nonparametric and Parametric Inference

Introducing of Dirichlet process prior in the Nonparametric Bayesian models frame work

Gene co-expression networks via biclustering Differential gene co-expression networks via Bayesian biclustering models

Biclustering-Driven Ensemble of Bayesian Belief Network Classifiers for Underdetermined Problems

عنوان ژورنال:

اشتراک گذاری